Overview

  • Reference genomes and GRC.
  • Fasta and FastQ (Unaligned sequences).
  • SAM/BAM (Aligned sequences).
  • BED (Genomic Intervals).
  • GFF/GTF (Gene annotation).
  • Wiggle files, BEDgraphs and BigWigs (Genomic scores).
  • VCF and MAF (Genomic variations).

Are there we there yet?

  • The human genome isnt complete!
  • In fact, most model organisms’s reference genomes are being regularly updated.
  • Reference genomes consist of mixture of known chromosomes and unplaced contigs called a Genome Reference Assembly.
  • Major revisions to assembies result in change of co-ordinates.
  • Requires conversion between revisions.
  • The latest genome assembly for humans is GRCh38.
  • Patches add information to the assembly without disrupting the chromosome coordinates . i.e GRCh38.p3

Why do we need to know about reference genomes

  • Allows for genes and genomic features to be evaluated in their linear genomic context.
    • Gene A is close to Gene B
    • Gene A and Gene B are within feature C.
  • Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.

A reference genome

  • A reference genome is a collection of contigs.
  • A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
  • Typically comes in FASTA format.
    • “>” line contains information on contig
    • Lines following contain contig sequence

igv

High-throughput Sequencing formats

  • Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence files.
    • FASTQ - Unaligned sequences
    • SAM - Aligned sequences

Unaligned Sequences

FastQ - Header

igv

  • Header for each read can contain additional information
    • HS2000-887_89 - Machine name.
    • 5 - Flowcell lane.
    • /1 - Read 1 or 2 of pair (here read 1)

Aligned sequences

SAM format

  • SAM - Sequence Alignment Map.
  • Standard format for sequence data
  • Recognised by majority of software and browsers.

Aligned sequences

SAM - Aligned reads

igv

  • Contains read and alignment information and location

Aligned sequences

SAM - Aligned reads

igv

  • Read name.
  • Sequence of read.
  • Encoded sequence quality.

Aligned sequences

SAM - Aligned reads

igv

  • Chromosome to which read aligns.
  • Position in chromosome to which 5’ of read aligns.
  • Alignment information - “Cigar string”.
    • 100M - Continuous match of 100 bases
    • 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match

Aligned sequences

SAM - Aligned reads

igv

class: inverse, center, middle

Summarised Genomic Features formats.


Summarising in genomic intervals.

** BED format (BED) **

igv

  • Simple format
  • 3 tab separated columns
  • Chromsome, start, end

Summarising in genomic intervals.

** narrowPeak and broadPeak**

  • narrowPeak and broadPeak are extensions to BED6 used in Encode’s peak calling.
  • Contains p-values, q-values.
  • narrowPeak - BED 6+4
  • broadPeak - BED6+3

Signal at genomic positions

.pull-left[

igv

] .pull-right[ - Information line - Chromosome - Step size - Step start position - Score]

class: inverse, center, middle

Genomic Annotation.


Genomic Annotation

igv

  • Chromosome
  • Start of feature
  • End of Feature
  • Strand

Genomic Annotation

igv

  • Column 9 contains key pairs (ID=exon01), separated by semi-colons “;”
  • ID - Feature name.
  • PARENT- Meta-feature name.

Genomic Variants

  • Variant Call Format (VCF)
  • Mutation Annotation Format (MAF)

VCF Structure

datasetSource


Mutation Annotation Format (MAF)

Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF files.

class: inverse, center, middle

Genomic Files for computing .